CRFsuite Tutorial
Task description
In this example, NP stands for a noun phrase, VP for a verb phrase, and PP for a prepositional phrase.
sequential labeling task(系列ラベリング)
The goal of this tutorial is to build a model that predicts chunk labels for a given sentence (sequence of tokens) by using CRFsuite.
Training and testing data
例:London JJ B-NP
The data consists of a set of sentences (sequences) each of which contains a series of words (e.g., 'London', 'shares'), part-of-speech tags (e.g., 'JJ', 'NNS'), and chunk labels (e.g., 'B-NP', 'I-NP') separated by space characters.
Necessary scripts for this tutorial are included under example directory in the CRFsuite distribution.
In this tutorial, we would like to construct a CRF model that assigns a sequence of chunk labels, given a sequence of words and part-of-speech codes.
Feature (attribute) generation
In general, this is the most important process for machine-learning approaches because a feature design greatly affects the labeling accuracy.
In this tutorial, we extract 19 kinds of attributes from a word at position t (in offsets from the begining of a sequence)
前後2単語(w[t-2], w[t-1], w[t], w[t+1], w[t+2])
単語の連続(w[t-1]|w[t], w[t]|w[t+1])
前後2単語の品詞(pos[t-2], pos[t-1], pos[t], pos[t+1], pos[t+2])
1語の品詞考慮(pos[t-2]|pos[t-1], pos[t-1]|pos[t], pos[t]|pos[t+1], pos[t+1]|pos[t+2])
2語の品詞考慮(pos[t-2]|pos[t-1]|pos[t], pos[t-1]|pos[t]|pos[t+1], pos[t]|pos[t+1]|pos[t+2])
CRFsuite will learn associations between these attributes (e.g, "pos[0]|pos[1]|pos[2]=DT|JJ|NN") and labels (e.g., "B-NP") to predict a label sequence for a given text.
The convention "name=value" is merely for the convenience to interpret attribute names
CRFsuite accepts any string as an attribute name as long as the string does not contain a colon character
crfsuite learn
You can also train a CRF model, watching its performance (accuracy, precision, recall, f1 score) evaluated on the test data.
crfsuite learn -e2
crfsuite tag
Dumping the model file
crfsuite dump
Notes on writing attribute extractors
the common staffs (attribute generation from templates, data I/O, etc) are implemented in other modules, and
separator character(s) of an input data
field name(s) (ordered from left to right) of an input data, separated by a space character
3列:w, pos, y
London(=w) JJ(=pos) B-NP(=y)
attribute (feature) templates written as a Python tuple/list object
a tuple/list of (name, offset) pairs, in which name presents a field name, and offset presents an offset to the current position.
(('w', -2), ),=w[t-2]
(('w', -1), ('w', 0)),
the bigram starting at the previous token
Feature extractors for other tasks